TIPS
Write commands in code editor of R Studio and run them using icon
Run in R Studio.
We will download files containing gene counts per sample for an example RNA-seq project, and combine them into a single counts table. We’ll also prepare a metadata file for this project.
First, download the table of zip counts at this URL:
https://wd.cri.uic.edu/edgeR/featurecounts.zip
This file will be, by default, saved to your Downloads folder. Use your system’s file explorer to navigate to this folder, or enter the following into its address bar:
Once you find the featurecounts.zip file, you will need to unzip it.
There should be 6 files extracted which were generated by the featureCounts program. Each file has gene expression counts for one sample. Here is a preview of the first one (note that long fields are truncated):
# Program:featureCounts v2.0.3; Command:"featureCounts" "-t" "exon" "-g" "gene_id" "-p"
"-T" "1" "-s" "1" "-a" "mm9.gtf" "-o" "A.1" "A.1.bam"
Geneid Chr Start End Strand Length A.1.bam
Xkr4 chr1;chr... 3204563;... 3207049;... -;-;- 3634 0
Rp1 chr1;chr... 4280927;... 4283093;... -;-;-;-;... 9747 7
Sox17 chr1;chr... 4481009;... 4482749;... -;-;-;-;... 3130 229
Mrpl15 chr1;chr... 4763279;... 4764597;... -;-;-;-;... 4203 80
Lypla1 chr1;chr... 4797974;... 4798063;... +;+;+;+;... 2433 184
Tcea1 chr1;chr... 4847775;... 4848057;... +;+;+;+;... 2847 108
Rgs20 chr1;chr... 4899657;... 4900743;... -;-;-;-;... 2241 0
Atp6v1h chr1;chr... 5073254;... 5073359;... +;+;+;+;... 1976 233
There’s a comment line (first line starting with #) giving the command used to generate this file. After that the fields in the table give the gene ID, all exon coordinates and strands (Chr, Start, End, and Strand), total gene length in bp, and read counts for sample A.1. The read counts (last column) is the data we want for each sample, along with the gene IDs.
What we need to do is:
The steps below assume that the genes are in the same order in each file (we use cbind() without checking that the gene names match). In practice this is true for featureCounts files, but it is also something that is good to double-check in general.
This assumes you’ve downloaded all of the files to your “Downloads” folder and unpacked them there.
On Windows:
fc.files <- list.files("~/../Downloads/featurecounts",
pattern="*.counts.txt", full.names=T)
fc.files
[1] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/A.1.counts.txt"
[2] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/A.2.counts.txt"
[3] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/A.3.counts.txt"
[4] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/B.1.counts.txt"
[5] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/B.2.counts.txt"
[6] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/B.3.counts.txt"
On Mac or Linux:
fc.files <- list.files("~/Downloads/featurecounts",
pattern="*.counts.txt", full.names=T)
fc.files
[1] "/home/jasonmw/Downloads/featurecounts/A.1.counts.txt"
[2] "/home/jasonmw/Downloads/featurecounts/A.2.counts.txt"
[3] "/home/jasonmw/Downloads/featurecounts/A.3.counts.txt"
[4] "/home/jasonmw/Downloads/featurecounts/B.1.counts.txt"
[5] "/home/jasonmw/Downloads/featurecounts/B.2.counts.txt"
[6] "/home/jasonmw/Downloads/featurecounts/B.3.counts.txt"
Note that the file paths will reflect where the files are on YOUR computer.
We will use the lapply() function to have the read.delim() function read each of these files into R and return a list containing the data. We need to tell the read.delim() function to ignore comment lines (starts with #). We also need to subset the data frame read in from each file to just column 6 (gene counts). Note that this was column 7, but we set the first column as row names so the numbers shifted. Finally, we need to set drop=F to keep it structured as a data frame, otherwise it would become a vector and we would loose the row names.
fc.data <- lapply(fc.files,
function(x) { read.delim(x, row.names=1, comment="#")[, 6, drop=F] })
Now that we have all of the data loaded into a list, we will use the Reduce() function combined with the cbind() function to combine the data into a single data frame. The Reduce() function will iteratively apply a function to a list (note that we could have done this with a for loop instead). We can use cbind() because we know that all of the data tables loaded have the same rows.
counts.table <- Reduce(cbind, fc.data)
head(counts.table, n=3)
We need to remove the “.bam” suffixes from the column names using the
gsub() function (global substitution). Note that the $
symbol means: end of the line.
colnames(counts.table) <- gsub(".bam$", "", colnames(counts.table))
head(counts.table, n=3)
To get the best output, we will copy the rownames to a new column and save the table as a tab separated file.
counts.table <- cbind(Gene=rownames(counts.table), counts.table)
write.table(counts.table, "combined_counts.txt", sep="\t", row.names=F, quote=F)
You now have a counts table.
To make the metadata file, open up Microsoft Excel and create a file as follows:
Then choose Save As > Tab
delimited Text (.txt) and save the file to a location you can find. You
now have a metadata table.
Start a new R script and save it to your computer as “edgeR.R”. Execute the edgeR commands one-at-a-time.
ABOUT: These data are from an RNA-seq experiment with two experimental groups, A and B. There are three replicates per group.
library(edgeR)
data <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_counts.txt", row.names=1)
dim(data)
## [1] 23420 6
head(data, n=3)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_metadata.txt", row.names=1)
metadata
Check the data again to make sure we didn’t drop any columns!
data <- data[, rownames(metadata)]
dim(data)
## [1] 23420 6
head(data, n=3)
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 14193 6
genes <- DGEList(counts=data_subset, group=metadata[, 1])
genes
## An object of class "DGEList"
## $counts
## A.1 A.2 A.3 B.1 B.2 B.3
## Rp1 7 18 28 18 8 9
## Sox17 229 451 234 367 456 436
## Mrpl15 80 130 104 157 231 207
## Lypla1 184 215 189 292 344 377
## Tcea1 108 192 148 175 208 219
## 14188 more rows ...
##
## $samples
## group lib.size norm.factors
## A.1 A 3326193 1
## A.2 A 5032893 1
## A.3 A 5352456 1
## B.1 B 4979670 1
## B.2 B 5620129 1
## B.3 B 5886895 1
genes <- calcNormFactors(genes)
genes
## An object of class "DGEList"
## $counts
## A.1 A.2 A.3 B.1 B.2 B.3
## Rp1 7 18 28 18 8 9
## Sox17 229 451 234 367 456 436
## Mrpl15 80 130 104 157 231 207
## Lypla1 184 215 189 292 344 377
## Tcea1 108 192 148 175 208 219
## 14188 more rows ...
##
## $samples
## group lib.size norm.factors
## A.1 A 3326193 1.0236147
## A.2 A 5032893 1.0200193
## A.3 A 5352456 0.8468997
## B.1 B 4979670 1.1083495
## B.2 B 5620129 0.9269172
## B.3 B 5886895 1.1007924
genes <- estimateDisp(genes)
genes
## An object of class "DGEList"
## $counts
## A.1 A.2 A.3 B.1 B.2 B.3
## Rp1 7 18 28 18 8 9
## Sox17 229 451 234 367 456 436
## Mrpl15 80 130 104 157 231 207
## Lypla1 184 215 189 292 344 377
## Tcea1 108 192 148 175 208 219
## 14188 more rows ...
##
## $samples
## group lib.size norm.factors
## A.1 A 3326193 1.0236147
## A.2 A 5032893 1.0200193
## A.3 A 5352456 0.8468997
## B.1 B 4979670 1.1083495
## B.2 B 5620129 0.9269172
## B.3 B 5886895 1.1007924
##
## $common.dispersion
## [1] 0.07726973
##
## $trended.dispersion
## [1] 0.15541353 0.05724855 0.06490358 0.05832588 0.06245742
## 14188 more elements ...
##
## $tagwise.dispersion
## [1] 0.17099271 0.05072614 0.04405357 0.03575971 0.03376216
## 14188 more elements ...
##
## $AveLogCPM
## [1] 1.746894 6.165060 4.900612 5.724314 5.127011
## 14188 more elements ...
##
## $trend.method
## [1] "locfit"
##
## $prior.df
## [1] 4.134752
##
## $prior.n
## [1] 1.033688
##
## $span
## [1] 0.2945153
sqrt(genes$common.dispersion)
## [1] 0.2779743
plotBCV(genes)
NOTE: logFC will be computed as B/A: positive means higher in B edgeR calculates this as the second group over the first group from exactTest
stats <- exactTest(genes)
stats
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Rp1 -0.93760389 1.746894 0.12038547
## Sox17 0.09658678 6.165060 0.72540551
## Mrpl15 0.54242240 4.900612 0.04323394
## Lypla1 0.36738965 5.724314 0.12314337
## Tcea1 0.04750527 5.127011 0.84813699
## 14188 more rows ...
##
## $comparison
## [1] "A" "B"
##
## $genes
## NULL
We will utilize the Benjamini-Hochberg (BH) method
stats$table$QValue <- p.adjust(stats$table$PValue, method="BH")
head(stats$table, n=3)
Check the number of significant genes (QValue < 0.05)
sum(stats$table$QValue < 0.05)
## [1] 684
We will utilize the cpm() function (Counts per Million) to normalize the expression data.
norm <- cpm(genes)
head(norm, n=3)
## A.1 A.2 A.3 B.1 B.2 B.3
## Rp1 2.055957 3.506278 6.176934 3.261333 1.535687 1.388835
## Sox17 67.259172 87.851756 51.621518 66.494965 87.534172 67.281361
## Mrpl15 23.496654 25.323122 22.942897 28.446075 44.342969 31.943215
You can use the normalized expression data from the previous exercise. If you are behind or had issues on the previous exercise, the following will load the needed data for this exercise.
norm <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_norm.txt", row.names=1)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_metadata.txt", row.names=1)
This is good practice, as gene expression varies over several orders of magnitude. We need to add a pseudo-count (0.1) to avoid taking the log of 0
norm.log <- log2(norm + 0.1)
We need to transpose the log-scaled normalized data using the t() function. We want to plot the SAMPLES, but the data table is currently setup for GENES
pca <- prcomp(t(norm.log))
We need to check the proportion of variance for each principal component (PC).
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 58.6293 31.5442 27.0428 24.24291 19.72479 7.086e-14
## Proportion of Variance 0.5598 0.1620 0.1191 0.09571 0.06336 0.000e+00
## Cumulative Proportion 0.5598 0.7218 0.8409 0.93664 1.00000 1.000e+00
We will extract the proportion of variance for each PC from the
summary() function. We will use the sprintf() function to create labels
for the X and Y axis. %.2f is a formatting code to print a
floating point number with 2 decimal places. %% is the
formatting code to display a percent symbol.
importance <- summary(pca)$importance[2,]
xlabel <- sprintf("PC1 (%.2f%%)", 100 * importance[1])
ylabel <- sprintf("PC2 (%.2f%%)", 100 * importance[2])
We will use cbind() to merge the principal components data with the metadata for each sample.
pca.data <- cbind(pca$x, metadata, Sample=rownames(metadata))
head(pca.data, n=3)
We need to load the ggplot2 library. We will use the geom_point()
method to display the PCA data, and we will use geom_text() to add
sample labels; hjust=0 and vjust=0 makes it
left-justified. The coord_cartesian() method with
clip='off' set allows the labels go outside the plot
boundary.
library(ggplot2)
ggplot(pca.data, aes(x=PC1, y=PC2, label=Sample, color=Group)) +
geom_point() + geom_text(hjust=0, vjust=0, show.legend=F, color="black") +
coord_cartesian(clip='off') + labs(x=xlabel, y=ylabel)
We need to first format the variance data for use with ggplot2 using the data.frame() method.
importance.df <- data.frame(PC=names(importance), Variance=importance)
ggplot(importance.df, aes(x=PC, y=Variance)) + geom_col() +
labs(x="Principal Component", y="Variance Explained")
We will make a heatmap of z-scored log2 CPM values for all differentially expressed genes (FDR < 0.05). We will specify our own color scheme and limit the ranges of z-scores that are plotted.
You can use the stats and norm.log variables we created earlier. If you are behind or had issues on that exercise, the following will load the needed data for this exercise.
stats <- readRDS(url("https://wd.cri.uic.edu/edgeR/stats.rds"))
norm.log <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_lognorm.txt", row.names=1)
We will use ComplexHeatmap to create the heatmap figure and we will use circlize to generate colors.
library(ComplexHeatmap)
library(circlize)
We will use the more stringent Q-value that we calculated earlier. Note that the norm and stats table will be in the same row order, so this will work.
degs.norm <- norm.log[stats$table$QValue < 0.05,]
dim(degs.norm)
## [1] 684 6
This will account for gene-to-gene differences in baseline expression. We need to transpose twice (t() function), as the scale() function works on columns
degs.norm.z <- t(scale(t(degs.norm)))
This will scale the colors from blue to white to red, with max/min at +/-2 and the middle at 0.
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))
We will create a basic heatmap, turning off the row names and using
the row_split and column_split options to add
some space between clusters in the data. The draw() function finalizes
the image created by the Heatmap() function and allows us to save the
image to variable ht while also plotting the figure to the
display device; this is the ComplexHeatmap way of doing things.
ht <- draw(Heatmap(degs.norm.z, show_row_names=F, name="Z-scored log2 CPM",
col=col_fun, row_split=2, column_split=2))
Notice that sample B.2 is a little bit “different” from the other B samples, with more extreme colors. Sample B.2 is also the more “different” one in the PCA plot. Overall, B.2 is more different from the group A samples than the other B samples, which we can see in both the PCA plot and the heatmap. How can we get the genes that correspond to the heatmap?
genes_order <- row_order(ht)
Note that genes_order is a list of 2 vectors. Each
vector is the order (indices) of the genes for each row cluster. To get
the actual gene names for each cluster, we would need to select the
names from the degs.norm.z based on the indices in the vectors.
cluster1_genes <- rownames(degs.norm.z)[ genes_order[[1]] ]
head(cluster1_genes, n=3)
## [1] "Eif2s3y" "Fbn2" "Rbm20"
Note that gene_clusters will also be a list.
gene_clusters <- sapply(genes_order, function(x) rownames(degs.norm.z)[x])
head(gene_clusters[[1]], n=3)
## [1] "Eif2s3y" "Fbn2" "Rbm20"
head(gene_clusters[[2]], n=3)
## [1] "Ptafr" "Fkbp10" "Il1b"
You can save a really tall heatmap (with gene names turned ON) to a PDF to double-check the gene orders.
pdf("pairwise_long_heatmap.pdf", height=100)
Heatmap(degs.norm.z, show_row_names=T, name="Z-scored log2 CPM",
col=col_fun, row_split=2, column_split=2)
dev.off()
We will read in the data, set up our model, and estimate the dispersion similarly to what we did before. Then we will use glmQLFit() to fit the model, and glmQLFTest() to test the model
ABOUT: These data are from an RNA-seq experiment comparing two different disease models (model1 and model2) to a healthy control group, so there are 3 experimental groups in total. There are 4 replicates per group.
It is safe to skip this step if you have already load the library in Rstudio.
library(edgeR)
data <- read.delim("https://wd.cri.uic.edu/edgeR/anova_counts.txt", row.names=1)
dim(data)
## [1] 24421 12
head(data, n=3)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/anova_metadata.txt", row.names=1)
head(metadata, n=3)
We will check the data again to make sure we didn’t drop any columns!
data <- data[, rownames(metadata)]
dim(data)
## [1] 24421 12
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 16713 12
treat <- factor(metadata[, 1])
treat
## [1] Control Control Control Control Model1 Model1 Model1 Model1 Model2
## [10] Model2 Model2 Model2
## Levels: Control Model1 Model2
model <- model.matrix(~treat)
model
## (Intercept) treatModel1 treatModel2
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## 6 1 1 0
## 7 1 1 0
## 8 1 1 0
## 9 1 0 1
## 10 1 0 1
## 11 1 0 1
## 12 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treat
## [1] "contr.treatment"
We will create the edgeR object without setting the groups this time.
genes <- DGEList(counts=data_subset)
genes <- calcNormFactors(genes)
genes <- estimateDisp(genes, model)
sqrt(genes$common.dispersion)
## [1] 0.1719392
plotBCV(genes)
fit <- glmQLFit(genes, model)
qlf <- glmQLFTest(fit, coef=2:3)
qlf$table$QValue <- p.adjust(qlf$table$PValue, method="BH")
head(qlf$table, n=3)
Let’s check the number of significant genes
sum(qlf$table$QValue < 0.05)
## [1] 5710
We will have the cpm() function directly generate log-scaled gene expression this time rather than doing it ourselves.
norm <- cpm(genes, log=T)
norm table based on differentially expressed genesWe will use the more stringent Q-value that we calculated earlier.
Note that norm and qlf$table tables will be in
the same row order, so this will work.
degs.norm <- norm[qlf$table$QValue < 0.05,]
Note that we are using t() to transpose twice because the scale() function works on columns and we need to z-score on rows
degs.norm.z <- t(scale(t(degs.norm)))
Heatmap(degs.norm.z, show_row_names=F, name="One-way ANOVA, Z-scored log2 CPM",
col=col_fun)
Let’s try a contrast where we compare the control group to the average of both treatments. We’ll remake the model matrix without an intercept for this, but using the same data as before.
This time we will make the model without the intercept (add
0 into the formula).
model.alt <- model.matrix(~0 + treat)
model.alt
## treatControl treatModel1 treatModel2
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 0 1 0
## 6 0 1 0
## 7 0 1 0
## 8 0 1 0
## 9 0 0 1
## 10 0 0 1
## 11 0 0 1
## 12 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treat
## [1] "contr.treatment"
genes.alt <- DGEList(counts=data_subset)
genes.alt <- calcNormFactors(genes.alt)
genes.alt <- estimateDisp(genes.alt, model.alt)
This should be the same as last time: these models are ultimately equivalent in terms of how they model the data, but differ in how easily we can ask different questions of them
sqrt(genes.alt$common.dispersion)
## [1] 0.1719392
fit.alt <- glmQLFit(genes.alt, model.alt)
Let’s try a pair-wise comparison between model1 and model2.
comp1 <- makeContrasts(treatModel1 - treatModel2, levels=model.alt)
Let’s try a comparison of control vs average of model1 and model2.
comp2 <- makeContrasts(treatControl - (treatModel1 + treatModel2)/2, levels=model.alt)
qlf1 <- glmQLFTest(fit.alt, contrast=comp1)
qlf2 <- glmQLFTest(fit.alt, contrast=comp2)
qlf1$table$QValue <- p.adjust(qlf1$table$PValue, method="BH")
qlf2$table$QValue <- p.adjust(qlf2$table$PValue, method="BH")
We can use the same normalized expression that we got above, already in log-scale subset based on degs
degs.norm1 <- norm[qlf1$table$QValue < 0.05,]
degs.norm2 <- norm[qlf2$table$QValue < 0.05,]
How many genes are significant in both cases?
length(intersect(rownames(degs.norm1), rownames(degs.norm2)))
## [1] 1914
degs.norm.z1 <- t(scale(t(degs.norm1)))
degs.norm.z2 <- t(scale(t(degs.norm2)))
Heatmap 1: these are genes where model1 and model2 are different
Heatmap(degs.norm.z1, show_row_names=F, name="Contrast 1, Z-scored log2 CPM",
col=col_fun)
Heatmap 2: these are genes where the average of model1 and model2 is dfferent from control
Heatmap(degs.norm.z2, show_row_names=F, name="Contrast 2, Z-scored log2 CPM",
col=col_fun)
Set up and run a two-way ANOVA-type model with an interaction term.
ABOUT: These are data from a mouse RNA-seq experiment where animals were subjected to a disease model and allowed to recover; we have disease conditions: control (no disease), peak disease, and recovery. Two different nerve tissues were collected to profile gene expression: the dorsal root ganglion (DRG) and the sciatic nerve. The goal is to look for similarities and differences in the tissue responses to the disease model.
This step is safe to skip if you have not restarted your R session.
library(edgeR)
data <- read.delim("https://wd.cri.uic.edu/edgeR/twofactor_counts.txt", row.names=1)
dim(data)
## [1] 39056 18
head(data, n=3)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/twofactor_metadata.txt", row.names=1)
head(metadata, n=3)
Check the data dimensions again to make sure we didn’t drop any columns!
data <- data[, rownames(metadata)]
dim(data)
## [1] 39056 18
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 20564 18
disease <- factor(metadata[, 1])
tissue <- factor(metadata[, 2])
model <- model.matrix(~ disease + tissue + disease:tissue)
genes <- DGEList(counts=data_subset)
genes <- calcNormFactors(genes)
genes <- estimateDisp(genes, model)
sqrt(genes$common.dispersion)
## [1] 0.1542001
plotBCV(genes)
fit <- glmQLFit(genes, model)
Keep in mind which coefficients these correspond to.
qlf.disease <- glmQLFTest(fit, coef=2:3)
qlf.tissue <- glmQLFTest(fit, coef=4)
qlf.inter <- glmQLFTest(fit, coef=5:6)
qlf.disease$table$QValue <- p.adjust(qlf.disease$table$PValue, method="BH")
qlf.tissue$table$QValue <- p.adjust(qlf.tissue$table$PValue, method="BH")
qlf.inter$table$QValue <- p.adjust(qlf.inter$table$PValue, method="BH")
Let’s check the number of significant genes.
sum(qlf.disease$table$QValue < 0.05)
## [1] 758
sum(qlf.tissue$table$QValue < 0.05)
## [1] 13497
sum(qlf.inter$table$QValue < 0.05)
## [1] 2102
We want to find genes affected by the disease, but then evaluate them separately depending on whether they behaved differently or similarly between the tissues. We will use the interaction term to sort based on different vs similar effect.
Let’s first make a PCA plot with the tissue as plot shape and disease as color to help us visualize the effects of tissue and disease.
Note you can skip this step if you already have these libraries loaded from earlier in the day.
library(ggplot2)
library(ComplexHeatmap)
library(circlize)
norm <- cpm(genes, log=T)
pca <- prcomp(t(norm))
pca.data <- cbind(pca$x, metadata, Sample=rownames(metadata))
importance <- summary(pca)$importance[2,]
xlabel <- sprintf("PC1 (%.2f%%)", 100 * importance[1])
ylabel <- sprintf("PC2 (%.2f%%)", 100 * importance[2])
We will plot tissue as plot shape and disease as color. Note that we
turn “clipping” off by using the coord_cartesian option;
this allows the text labels to flow outside of the plot area.
ggplot(pca.data, aes(x=PC1, y=PC2, label=Sample, color=Disease, shape=Tissue)) +
geom_point() + geom_text(hjust=0, vjust=0, show.legend=F, color="black") +
coord_cartesian(clip='off') + labs(x=xlabel, y=ylabel)
Since these PCA results have 18 PC, the default alphanumeric sort
that is used will give us an odd looking plot. The X axis would go from
PC1 to PC10 and PC11 instead of PC2 and PC3. To fix this, we will use
the factor() function with the levels argument to order the
PC as we would expect.
importance.pc <- factor(names(importance), levels=names(importance))
importance.df <- data.frame(PC=importance.pc, Variance=importance)
ggplot(importance.df, aes(x=PC, y=Variance)) + geom_col() +
labs(x="Principal Component", y="Variance Explained")
Above, we can see that the strongest effect (along PC1) is due to tissue. This matches the number of DEGs observed, and would be expected as tissues will generally have very different transcriptomic profiles.
We will determine a filtering strategy using the two-way ANOVA statistics, and make heatmaps from each option.
norm.z <- t(scale(t(norm)))
disease.sel <- qlf.disease$table$QValue < 0.05
inter.sel <- qlf.inter$table$QValue < 0.05
Genes are significant for disease AND interaction.
different_effect <- norm.z[disease.sel & inter.sel,]
nrow(different_effect)
## [1] 310
Genes are significant for disease but NOT interaction.
shared_effect <- norm.z[disease.sel & !inter.sel,]
nrow(shared_effect)
## [1] 448
We will use the same color scale as before, with colors from blue to white to red, with max/min at +/-2 and the middle at 0.
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))
We will use the HeatmapAnnotation() method from ComplexHeatmap to add colored bars above the heatmap to show disease and tissue metadata.
col_labels <- HeatmapAnnotation(df=metadata)
We will utilize the column_split parameter to make side-by-side heatmaps for the tissues.
Heatmap(different_effect, column_title="Different effect", name="Z-scored log2 CPM",
col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
column_split=tissue)
Heatmap(shared_effect, column_title="Shared effect", name="Z-scored log2 CPM",
col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
column_split=tissue)
These look mostly the same, and they don’t show any differences based on disease! That’s because the effect size, and in turn the color scale, is dominated by the tissue differences.
What we really want to see are the disease effects in each tissue separately. We can fix this by z-scoring separately for each tissue.
norm.drg <- norm[, tissue == "DRG"]
norm.sciatic <- norm[, tissue == "Sciatic_nerve"]
norm.drg.z <- t(scale(t(norm.drg)))
norm.sciatic.z <- t(scale(t(norm.sciatic)))
norm.z <- cbind(norm.drg.z, norm.sciatic.z)
norm.z <- norm.z[, colnames(norm)]
Note: this is the same as above, so don’t be afraid to copy-paste
different_effect <- norm.z[disease.sel & inter.sel,]
shared_effect <- norm.z[disease.sel & !inter.sel,]
We will utilize the column_split parameter to make
side-by-side heatmaps for the tissues.
Heatmap(different_effect, column_title="Different effect", name="Z-scored log2 CPM",
col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
column_split=tissue)
Heatmap(shared_effect, column_title="Shared effect", name="Z-scored log2 CPM",
col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
column_split=tissue)
The “different effect” heatmap shows a variety of patterns, such as genes that are down-regulated in peak disease in sciatic nerve, followed by an increase to normal in recovery; these same genes increase in the DRG in peak disease, and remain elevated in recovery.
In the “shared effect” heatmap, we see a few different patterns, but they are shared across tissues.
These are some examples of how we would use different filtering criteria and/or normalization strategies to address different questions.
NOTE: The columns of the data and the rows of the metadata are not in the same order for this data set. We will use the same trick to reorder the data within R.
ABOUT: These data are from a 16S amplicon experiment where the same cohort of animals was subject to one of three treatments (A, B, or C) over a 5-day time course. Samples were collected before treatmeant (Pre), and after 2, 3, and 5 days. The counts table are taxonomic summaries at a genus level. We’ll run a repeated measures test controlling for animal-to-animal differences and testing for differences between time points and treatments.
data <- read.delim("https://wd.cri.uic.edu/edgeR/16S_counts.txt", row.names=1)
dim(data)
## [1] 218 60
head(data, n=3)
## S001_1_107 S002_2_108
## Bacteria;Verrucomicrobiota;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 5414 6918
## Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter 1538 0
## Bacteria;Firmicutes;Clostridia;Oscillospirales;Ruminococcaceae;Ruminococcus 9123 5877
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/16S_metadata.txt", row.names=1)
head(metadata, n=3)
Let’s check the distribution of animal IDs over time points to see which animals were subjected to different treatments. Note that some treatments are missing a couple animals.
table(metadata)
## AnimalID
## Group 107 108 109 110 111 113 114 115 116 117 118 120 121 122 123
## Day2.A 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## Day2.B 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
## Day2.C 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## Day3.A 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## Day3.B 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0
## Day3.C 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## Day5.A 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## Day5.B 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0
## Day5.C 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
## Pre 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Let’s make sure the metadata row order matches the data column order
nrow(metadata)
## [1] 55
ncol(data)
## [1] 60
NOTE: sample numbers are different: check what’s different.
colnames(data)[!colnames(data) %in% rownames(metadata)]
## [1] "S016_NC1" "S028_27_119" "S043_42_119" "S048_NC2" "S058_56_119"
The missing samples are: * 3 samples from animal 119 – did not have a Pre sample, excluded from metadata * 2 “NC” samples: negative controls – we want to exclude from analysis
data <- data[, rownames(metadata)]
dim(data)
## [1] 218 55
We have 218 samples, so this is an average of 2.3 counts per sample. We are also most interested in the more abundant taxa, lower abundant taxa are more likely to be data processing artifacts.
data_subset <- data[rowSums(data) > 500,]
dim(data_subset)
## [1] 121 55
treat <- factor(metadata[, 1])
subject <- factor(metadata[, 2])
model <- model.matrix(~ 0 + treat + subject)
taxa <- DGEList(counts=data_subset)
taxa <- calcNormFactors(taxa)
taxa <- estimateDisp(taxa, model)
Check the BCV.
sqrt(taxa$common.dispersion)
## [1] 0.7809027
Fit the model.
fit <- glmQLFit(taxa, model)
First, let’s look at coefficient names in the model.
colnames(model)
## [1] "treatDay2.A" "treatDay2.B" "treatDay2.C" "treatDay3.A" "treatDay3.B"
## [6] "treatDay3.C" "treatDay5.A" "treatDay5.B" "treatDay5.C" "treatPre"
## [11] "subject108" "subject109" "subject110" "subject111" "subject113"
## [16] "subject114" "subject115" "subject116" "subject117" "subject118"
## [21] "subject120" "subject121" "subject122" "subject123"
Let’s run a contrast for Day5.A vs Pre treatment.
contrast <- makeContrasts(treatDay5.A - treatPre, levels=model)
qlf <- glmQLFTest(fit, contrast=contrast)
qlf$table$QValue <- p.adjust(qlf$table$PValue, method="BH")
Note that we will cover batch effect removal again in the next exercise. Here we are removing the batch effect based on the patient ID.
taxa.norm <- cpm(taxa, log=T)
taxa.norm <- removeBatchEffect(taxa.norm, subject)
taxa.signif <- rownames(qlf$table)[qlf$table$QValue < 0.05]
length(taxa.signif)
## [1] 13
norm.signif <- taxa.norm[taxa.signif,]
norm.signif <- t(scale(t(norm.signif)))
First remove “Other” taxa at the end (the $ anchors the
search to the end of the string), then remove all higher-level taxa
names.
rownames(norm.signif) <- gsub(";Other$", "", rownames(norm.signif))
rownames(norm.signif) <- gsub(".*;", "", rownames(norm.signif))
group.labels <- HeatmapAnnotation(Group=treat)
We will force the treatment groups to be together using the
column_split option. We will also hide the column names and
dendrogram in favor of our annotations using the
column_title and show_column_dend option.
Heatmap(norm.signif, col=col_fun, name="Z-scored log2 CPM", column_split=treat,
top_annotation=group.labels, column_title=NULL, show_column_dend=F)
contrast.sel <- treat == "Day5.A" | treat == "Pre"
norm.signif.subset <- norm.signif[, contrast.sel]
subset.labels <- HeatmapAnnotation(Group=treat[contrast.sel])
Heatmap(norm.signif.subset, col=col_fun, name="Z-scored log2 CPM",
column_split=treat[contrast.sel], top_annotation=subset.labels,
column_title=NULL, show_column_dend=F)
Exercise to try at home: there are a LOT more contrasts we could run on these data. Can you figure out how to do Day3.A vs Pre? How about the average of Day2 vs Pre (how would you do an average of multiple groups in a contrast)?
ABOUT: These data are from an RNA-seq experiment comparing B cells from five mouse genotypes, with three replicates per group. However, the replicates were collected at different times, so there is a batch effect present, especially for replicate 1.
Reload the edgeR library in case you closed R studio
library(edgeR)
data <- read.delim("https://wd.cri.uic.edu/edgeR/batch_counts.txt", row.names=1)
dim(data)
## [1] 23366 15
head(data, n=3)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/batch_metadata.txt", row.names=1)
head(metadata, n=3)
Check the data again to make sure we didn’t drop any columns!
data <- data[, rownames(metadata)]
dim(data)
## [1] 23366 15
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 14037 15
geno <- factor(metadata[, 1])
batch <- factor(metadata[, 2])
First, let’s try the analysis without the batch correction. We will just model the genotype.
model <- model.matrix(~ geno)
head(model, n=3)
## (Intercept) genoB genoC genoD genoE
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
genes <- DGEList(counts=data_subset)
genes <- calcNormFactors(genes)
genes <- estimateDisp(genes, model)
sqrt(genes$common.dispersion)
## [1] 0.3957431
fit <- glmQLFit(genes, model)
Also do the B-H p-value adjustment.
qlf <- glmQLFTest(fit, coef=2:5)
qlf$table$QValue <- p.adjust(qlf$table$PValue, method="BH")
Check the number of significant genes.
sum(qlf$table$QValue < 0.05)
## [1] 1528
norm <- cpm(genes, log=T)
pca <- prcomp(t(norm))
importance <- summary(pca)$importance[2,]
xlabel <- sprintf("PC1 (%.2f%%)", 100 * importance[1])
ylabel <- sprintf("PC2 (%.2f%%)", 100 * importance[2])
We will merge the PCA results with the metadata.
pca.data <- cbind(pca$x, metadata, Sample=rownames(metadata))
We will set the shape to be based on the batch and the color based on
the genotype. Note that we have to force Batch to be seen
as a factor instead of a continuous variable by using the as.factor()
function. We will need to adjust its label in the legend to make up for
this.
ggplot(pca.data, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=as.factor(Batch), color=Genotype)) +
geom_text(aes(label=Sample), hjust=0, vjust=0, show.legend=F) +
coord_cartesian(clip='off') + labs(x=xlabel, y=ylabel, shape="Batch")
Note that all replicate 1 samples are to the left, and other samples are to the right. This separation is along PC1, which is capturing the largest portion of the variance. So, there is a batch effect, and it’s pretty strong.
The new model matrix with batch factor and no interaction term
batch.model <- model.matrix(~ geno + batch)
head(batch.model, n=3)
## (Intercept) genoB genoC genoD genoE batch2 batch3
## 1 1 0 0 0 0 0 0
## 2 1 0 0 0 0 1 0
## 3 1 0 0 0 0 0 1
batch.genes <- DGEList(counts=data_subset)
batch.genes <- calcNormFactors(batch.genes)
batch.genes <- estimateDisp(batch.genes, batch.model)
This should be much smaller!
sqrt(batch.genes$common.dispersion)
## [1] 0.2762293
batch.fit <- glmQLFit(batch.genes, batch.model)
This test looks the same as before, the batch terms are just “along for the ride”. Also do the B-H p-value adjustment.
batch.qlf <- glmQLFTest(batch.fit, coef=2:5)
batch.qlf$table$QValue <- p.adjust(batch.qlf$table$PValue, method="BH")
Check the number of significant genes. There should be many more!
sum(batch.qlf$table$QValue < 0.05)
## [1] 2564
We just need to give the batch factor vector. We can start with the same normalized expression as above
batch.norm <- removeBatchEffect(norm, batch)
batch.pca <- prcomp(t(batch.norm))
batch.importance <- summary(batch.pca)$importance[2,]
batch.xlabel <- sprintf("PC1 (%.2f%%)", 100 * batch.importance[1])
batch.ylabel <- sprintf("PC2 (%.2f%%)", 100 * batch.importance[2])
batch.pca.data <- cbind(batch.pca$x, metadata, Sample=rownames(metadata))
ggplot(batch.pca.data, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=as.factor(Batch), color=Genotype)) +
geom_text(aes(label=Sample), hjust=0, vjust=0, show.legend=F) +
coord_cartesian(clip='off') +
labs(x=batch.xlabel, y=batch.ylabel, shape="Batch")
This looks better, the separation based on batch is gone.
ABOUT: This is a clinical data set (46 samples), where some patients had a disease and others did not. The patients were also scored according to a separate phenotype. We are looking for genes that are different according to the disease, or the phenotype, or based on a combination (interaction) between these.
data <- read.delim("https://wd.cri.uic.edu/edgeR/cont_counts.txt", row.names=1)
dim(data)
## [1] 57230 46
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/cont_metadata.txt", row.names=1)
head(metadata, n=3)
Check the data again to make sure we didn’t drop any columns!
data <- data[, rownames(metadata)]
dim(data)
## [1] 57230 46
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 19827 46
We will split the scores into blocks of 20 for plotting purposes using the seq() function.
breaks <- seq(0, max(metadata$Phenotype_Score) + 20, 20)
ggplot(metadata, aes(x=Phenotype_Score)) +
geom_histogram(breaks=breaks, fill="white", col="black")
ggplot(metadata, aes(x=Disease, y=Phenotype_Score)) +
geom_boxplot()
kruskal.test(Phenotype_Score ~ Disease, data=metadata)
##
## Kruskal-Wallis rank sum test
##
## data: Phenotype_Score by Disease
## Kruskal-Wallis chi-squared = 0.27836, df = 1, p-value = 0.5978
This looks OK: if there was a strong association (small p-value), then the variables would be confounded and we couldn’t test them independently. So we can continue:
We will use the as.numeric() function for the score to ensure it is seen as a number instead of a factor.
score <- as.numeric(metadata[, 1])
disease <- factor(metadata[, 2])
model <- model.matrix(~ score + disease + score:disease)
head(model, n=3)
## (Intercept) score diseaseTRUE score:diseaseTRUE
## 1 1 82 0 0
## 2 1 90 0 0
## 3 1 65 1 65
This may take a minute - there are a lot of samples!
genes <- DGEList(counts=data_subset)
genes <- calcNormFactors(genes)
genes <- estimateDisp(genes, model)
Note the BCV is much higher in this data set. Clinical data sets are often highly variable!
sqrt(genes$common.dispersion)
## [1] 2.027726
fit <- glmQLFit(genes, model)
Let’s take a look at the coefficients to get a sense of directionality for each gene.
head(fit$coefficients, n=3)
## (Intercept) score diseaseTRUE score:diseaseTRUE
## ENSG00000000419 -10.931903 0.006592038 0.8829765 -0.013948637
## ENSG00000000457 -13.949042 0.010968106 3.7684653 -0.076915346
## ENSG00000000460 -9.714775 -0.049499178 -1.9882175 0.006747026
The continuous variable is always just 1 term.
qlf.score <- glmQLFTest(fit, coef=2)
qlf.disease <- glmQLFTest(fit, coef=3)
qlf.inter <- glmQLFTest(fit, coef=4)
qlf.score$table$QValue <- p.adjust(qlf.score$table$PValue, method="BH")
qlf.disease$table$QValue <- p.adjust(qlf.disease$table$PValue, method="BH")
qlf.inter$table$QValue <- p.adjust(qlf.inter$table$PValue, method="BH")
Check the number of significant genes for each
sum(qlf.score$table$QValue < 0.05)
## [1] 1964
sum(qlf.disease$table$QValue < 0.05)
## [1] 1114
sum(qlf.inter$table$QValue < 0.05)
## [1] 847
This uses data associated with the one-way ANOVA (Exercises 1.5 and 1.6). But we will run pairwise comparisons between all pairs of the factor levels.
# read in, reconicle, and filter data as usual
data <- read.delim("https://wd.cri.uic.edu/edgeR/anova_counts.txt", row.names=1)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/anova_metadata.txt", row.names=1)
data <- data[, rownames(metadata)]
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 16713 12
treat <- factor(metadata[, 1])
# start a new data frame to store the results in
pw_stats <- data.frame(Gene = rownames(data_subset))
# get a list of the levels, and the number
levs <- levels(treat)
nlevs <- length(levs)
# loop over all pairs of levels
for(i in 1:(nlevs-1)){
for(j in (i + 1):nlevs){
# names for these groups
A <- levs[i]
B <- levs[j]
# get the logical vectors for these groups
groupA <- A == treat
groupB <- B == treat
# subset the data for groupA or groupB data
pw_metadata <- treat[as.logical(groupA + groupB)]
pw_counts <- data_subset[, as.logical(groupA + groupB)]
# run our edgeR commands
pw_genes <- DGEList(counts=pw_counts, group=pw_metadata)
pw_genes <- calcNormFactors(pw_genes)
pw_genes <- estimateDisp(pw_genes)
pw_test <- exactTest(pw_genes)
pw_test$table$QValue <- p.adjust(pw_test$table$PValue, method="BH")
# add the name of this comparison to the column names
# edgeR does the comparisons as groupB / groupA, so
# we list groupB first
pw_name <- paste(B, A, sep="/")
colnames(pw_test$table) = paste(pw_name, ":", colnames(pw_test$table))
# add results to the pairwise list
pw_stats <- cbind(pw_stats, pw_test$table)
}
}
## Using classic mode.
## Using classic mode.
## Using classic mode.
head(pw_stats, n=3)
NOTE: you may find it useful to remake the above code as a function, which will allow you to rerun it again more easily on a new data set. For example:
pw_tests <- function(counts, groups){
# start a new data frame to store the results in
pw_stats <- data.frame(Gene = rownames(counts))
# get a list of the levels, and the number
levs <- levels(groups)
nlevs <- length(levs)
# loop over all pairs of levels
for (i in 1:(nlevs - 1)) {
for (j in (i + 1):nlevs) {
# names for these groups
A <- levs[i]
B <- levs[j]
# get the logical vectors for these groups
groupA <- A == groups
groupB <- B == groups
# subset the data for groupA or groupB data
pw_metadata <- treat[as.logical(groupA + groupB)]
pw_counts <- data_subset[, as.logical(groupA + groupB)]
# run our edgeR commands
pw_genes <- DGEList(counts = pw_counts, group = pw_metadata)
pw_genes <- calcNormFactors(pw_genes)
pw_genes <- estimateDisp(pw_genes)
pw_test <- exactTest(pw_genes)
pw_test$table$QValue <- p.adjust(pw_test$table$PValue, method = "BH")
# add the name of this comparison to the column names
# edgeR does the comparisons as groupB / groupA, so
# we list groupB first
pw_name <- paste(B, A, sep = "/")
colnames(pw_test$table) = paste(pw_name, ":", colnames(pw_test$table))
# add results to the pairwise list
pw_stats <- cbind(pw_stats, pw_test$table)
}
}
return(pw_stats)
}
To run the function:
all_pairwise <- pw_tests(data_subset, treat)
head(all_pairwise, n=3)
PLEASE AVOID NEEDING TO DO THIS: you should always have biological replicates. But if you absolutely must analyze the data without replicates, at least do it by setting a fixed dispersion and computing approximate p-values, rather than simply looking at the fold-change. Using a fold-change threshold will bias you towards low-expressed genes. Note that the significance level of the q-values from this analysis are very dependent on the specific BCV you choose. However, it is fair to use the q-value to prioritize the genes based on which appear to be most strongly affected.
data <- read.delim("https://wd.cri.uic.edu/edgeR/norep_counts.txt", row.names=1)
dim(data)
## [1] 24421 2
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/norep_metadata.txt", row.names=1)
metadata
Check the data again to make sure we didn’t drop any columns!
data <- data[, rownames(metadata)]
dim(data)
## [1] 24421 2
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 14947 2
genes <- DGEList(counts=data_subset, group=metadata[, 1])
genes <- calcNormFactors(genes)
Since we don’t have the replicates samples to estimate dispersion, we will just pick a reasonable value. Let’s assume a BCV of 0.2 (20%).
bcv <- 0.2
Here we will use the dispersion option using the square
of the estimated BCV. Also run the B-H FDR correction.
stats <- exactTest(genes, dispersion=bcv^2)
stats$table$QValue <- p.adjust(stats$table$PValue, method="BH")
head(stats$table[order(stats$table$QValue),], n=3)
Use the biomaRt package from Bioconductor. If this is the first time using biomaRt, we will need to install it with BiocManager.
if (!require("BiocManager", quietly=T))
install.packages("BiocManager")
BiocManager::install("biomaRt")
library(biomaRt)
data <- read.delim("https://wd.cri.uic.edu/edgeR/mouse_ensembl_rnaseq.txt", row.names=1)
head(data, n=3)
Note that the identifiers are:
ENS).MUS). If no three-letter code is shown, then
they’re human IDs.G)..[number]). Will need to
remove versions before looking up symbols.This is is also necessary if you’re going to upload to a pathway
database. This is an example of a REGULAR EXPRESSION (REGEX). We have
seen a little of also in the grep and sed
commands in linux. Very powerful!
Note that \\. matches the period and [0-9]*
matches 1 or more numbers.
head(rownames(data), n=3)
## [1] "ENSMUSG00000051951.5" "ENSMUSG00000102851.1" "ENSMUSG00000103147.1"
newnames <- gsub("\\.[0-9]*", "", rownames(data))
head(newnames, n=3)
## [1] "ENSMUSG00000051951" "ENSMUSG00000102851" "ENSMUSG00000103147"
Now convert the IDs to gene symbols. Check the manual for the
biomaRt package for more details. In particular:
For today, we already know the database
(mmusculus_gene_ensembl) and attributes we want
(ensembl_gene_id and gene symbol AKA
external_gene_name). Note that both of these commands may
take some time to run, since they are fetching data from an online
database hosted by Ensembl.
mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
gene_list <- getBM(filters="ensembl_gene_id",
attributes=c("ensembl_gene_id", "external_gene_name"),
values=newnames, mart=mart)
gene_list = read.delim("gene_list.txt")
head(gene_list, n=3)
length(newnames)
## [1] 39056
nrow(gene_list)
## [1] 38535
length(unique(gene_list[, 1]))
## [1] 38535
length(unique(gene_list[, 2]))
## [1] 38492
Now that you have the list, you can combine with an existing data
frame. In the example below, we combine with the counts table using
merge().
NOTE: usually you would want to keep the original IDs as the main identifiers throughout your analysis in edgeR, and then merge the symbols or other information with your normalized expression (CPM) or differential stats tables AFTER running the analysis. You can’t have a mix of symbols and other strings in your counts table when processing through edgeR.
rownames(data) = newnames
Keep all entries by using the all option. Unmatched IDs
will have NA values for the gene symbol.
data.named <- merge(data, gene_list, by.x="row.names", by.y="ensembl_gene_id", all=T)
head(data.named, n=3)
(Start from the end of Exercise 2.2 to lead into this exercise.)
There’s another way we could consider filtering for genes with a significant interaction: just filter directly on the interaction term, ignoring the independent effects of disease. That is:
any_inter <- norm.z[inter.sel,]
Or, we could separate just the significant interactions from those that also have independent disease effects:
inter_only <- norm.z[inter.sel & !disease.sel,]
Try a heatmap with the second option:
Heatmap(inter_only, col=col_fun, name="Z-scored log2 CPM",
top_annotation=col_labels, show_row_names=F, show_column_names=F,
column_split = tissue)
It looks like we have a clear effect in sciatic nerve, but the
picture in DRG is murkier. Assuming you’re looking at
norm.z with values z-scored separately for each tissue,
it’s a little hard to assess the relative effect in DRG because we’ve
z-scored.
We will try running one-way ANOVA tests separately in each tissue.
samples.drg <- tissue=="DRG"
samples.sn <- tissue=="Sciatic_nerve"
data.drg <- data_subset[, samples.drg]
data.sn <- data_subset[, samples.sn]
disease.drg <- disease[samples.drg]
disease.sn <- disease[samples.sn]
model.drg <- model.matrix(~disease.drg)
model.sn <- model.matrix(~disease.sn)
genes.drg <- DGEList(data.drg)
genes.sn <- DGEList(data.sn)
genes.drg <- calcNormFactors(genes.drg)
genes.sn <- calcNormFactors(genes.sn)
genes.drg <- estimateDisp(genes.drg, model.drg)
genes.sn <- estimateDisp(genes.sn, model.sn)
fit.drg <- glmQLFit(genes.drg, model.drg)
fit.sn <- glmQLFit(genes.sn, model.sn)
test.drg <- glmQLFTest(fit.drg, coef=2:3)
test.sn <- glmQLFTest(fit.sn, coef=2:3)
test.drg$table$QValue <- p.adjust(test.drg$table$PValue, method="BH")
test.sn$table$QValue <- p.adjust(test.sn$table$PValue, method="BH")
# count the number of DEGs
sum(test.drg$table$QValue < 0.05)
## [1] 921
sum(test.sn$table$QValue < 0.05)
## [1] 3684
# which of these genes were detected before in the "disease" factor?
sum(test.drg$table$QValue < 0.05 & disease.sel)
## [1] 548
sum(test.sn$table$QValue < 0.05 & disease.sel)
## [1] 365
It does look like the general disease effect is stronger in sciatic nerve. Remake the heatmap, with a row annotation indicating the significance for each tissue separately.
# prepare annotations for the heatmap based on -log10 FDR
signif.drg <- -log10(test.drg$table$QValue)
signif.sn <- -log10(test.sn$table$QValue)
labels.drg <- signif.drg[!disease.sel & inter.sel]
labels.sn <- signif.sn[!disease.sel & inter.sel]
# set an alternate color scale for significance levels
col_signif <- colorRamp2(c(0, -log10(0.05), 5), c("white", "red", "yellow"))
# use the same color scale for both annotations
anova.label <- rowAnnotation(DRG=labels.drg, SN=labels.sn,
col=list(DRG=col_signif, SN=col_signif))
Heatmap(inter_only, col=col_fun, name="Z-scored log2 CPM",
top_annotation=col_labels, show_row_names=F, show_column_names=F,
column_split = tissue, left_annotation = anova.label)
All of these show a significant effect in sciatic nerve, whereas only some of them show it in DRG.
For this exercise, we’ll use a shotgun metagenomics data set, with taxonomic quantification summarized at a phylum level. These are data obtained from a mouse testing a surgery injury model, and a treatment for the injury. We have two experimental factors: (1) Injury, with levels Naive, Sham, and Surgery; and (2) Treatment, with levels treated and control.
library(edgeR)
library(ggplot2)
Observe that the features are taxonomic names, down to phylum level. All are for bacteria.
data <- read.delim("https://wd.cri.uic.edu/edgeR/metagenomics_counts.txt", row.names=1)
head(rownames(data), n=3)
## [1] "sk_Bacteria;k__Bacteria incertae sedis;p__Caldiserica"
## [2] "sk_Bacteria;k__Bacteria incertae sedis;p__Candidatus Levybacteria"
## [3] "sk_Bacteria;k__Bacteria incertae sedis;p__Candidatus Magasanikbacteria"
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/metagenomics_metadata.txt",
row.names=1)
rownames(data) <- gsub("^.*;p__", "", rownames(data))
head(rownames(data), n=3)
## [1] "Caldiserica" "Candidatus Levybacteria"
## [3] "Candidatus Magasanikbacteria"
data <- data[, rownames(metadata)]
data_subset <- data[rowSums(data) > 20,]
Yyou can run differential stats also, but you should be an expert on that by now so we omit those steps here.
taxa <- DGEList(counts = data_subset)
taxa <- calcNormFactors(taxa)
taxa.norm <- cpm(taxa, log=T)
pca <- prcomp(t(taxa.norm))
pca.data <- cbind(pca$x, metadata)
ggplot(pca.data, aes(x=PC1, y=PC2, color=Injury)) + geom_point()
ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point()
ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point() +
facet_grid(~Injury)
ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point() +
facet_grid(~Injury) + stat_ellipse()
The Firmicutes/Bacteroidetes ratio is a common measure of the health of a gut microbiome. We can add that value to the PCA plot as the dot colors.
pca.data.expanded <- cbind(pca.data,
Bacteroidetes = t(data["Bacteroidetes",]),
Firmicutes = t(data["Firmicutes",]))
ggplot(pca.data.expanded, aes(x=PC1, y=PC2, color=Firmicutes / Bacteroidetes)) +
geom_point()
ggplot(pca.data.expanded, aes(x=PC1, y=PC2, color=log2(Firmicutes / Bacteroidetes))) +
geom_point()
This allows us to visualize the contribution of each feature on the principle components. Note that you may want to be careful with this plot if you have a huge number of features, but with a relatively small number of taxa it will work.
First, get the rotation matrix as a data frame.
rotation <- data.frame(taxa=rownames(pca$rotation), pca$rotation)
Next determine a scalar to multiply the vector lengths by so that the scales line correctly.
mult.pc1 <- max(pca.data[, "PC1"]) - min(pca.data[, "PC1"]) /
(max(rotation[, "PC1"]) - min(rotation[, "PC1"]))
mult.pc2 <- max(pca.data[, "PC2"]) - min(pca.data[, "PC2"]) /
(max(rotation[, "PC2"]) - min(rotation[, "PC2"]))
mult <- min(mult.pc1, mult.pc2)
rotation$PC1 = mult * rotation$PC1
rotation$PC2 = mult * rotation$PC2
Start creating the plot of the PCA data colored by treatment. We will separate this into multiple lines to demonstrate how you can add plot components together.
plot <- ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point()
Add vertical and horizontal lines for the middle of the biplot using geom_hline() and geom_vline() methods.
plot <- plot + geom_hline(yintercept=0) + geom_vline(xintercept=0)
Add text labels for each taxa. Set clip=“off” to let text labels go outside the plot area.
plot <- plot + geom_text(data=rotation, aes(x=PC1, y=PC2, label=taxa),
size=3, hjust=0, vjust=0, color="red") + coord_equal(clip='off')
Add arrows for the biplot: from the origin (0,0) to the PC coordinate in the rotation matrix.
plot <- plot + geom_segment(data=rotation, aes(x=0, y=0, xend=PC1, yend=PC2),
arrow=arrow(length=unit(0.2, "cm")), color="red")
Finally, print out the plot.
plot